ERA5 Shaped Allocation for Hourly Matching: How Much Do You Need to Know?

Nathan Iyer

Working Paper — February 2025

For flat (baseload) consumption profiles, hourly matching scores—the key metric for granular energy certificate markets—depend primarily on the cumulative distribution of generation values rather than on which specific hours are windy. This insight, validated across 26 UK wind farms and a 134-turbine Chinese continental site, means that ERA5 reanalysis data can reconstruct accurate hourly generation profiles for certificate allocation even without metered data. With a manufacturer-specific power curve (Tier 3), shaped allocation achieves a mean matching score error of 2.8 [2.0–3.8] percentage points (pp, i.e., absolute difference in matching score expressed as a percentage; 95% bootstrap CI), with 70% of farms within ±3 pp and 87% within ±5 pp. An averaged manufacturer curve for the offshore sector (leave-one-out validated) achieves 2.6 [2.0–3.3] pp—closing ~96% of the gap between Tier 0 (7.4 pp) and the manufacturer curve (2.4 pp) without any turbine-specific knowledge. Results use flat (baseload) load profiles throughout; commercial, residential, and industrial profiles show qualitatively similar patterns.

Spatial and temporal robustness are surprisingly strong. Using wind data from 100 km away adds less than 0.5 pp error. Using a prior year’s wind data (with near-zero hourly correlation to the actual year) produces only 1–4 pp error, while naive multi-year hour-by-hour averaging produces 10 [8.7–11.0] pp error because it compresses the generation distribution. A preliminary test on the SDWPF Chinese continental dataset confirms that Tier 3 accuracy transfers cross-geography (±4.5 pp), though this single-site result does not establish generalization.

Distribution-preserving averaging methods completely eliminate the naive averaging penalty: Weibull parameter averaging achieves 1.9 [1.1–2.8] pp with 3 years and 1.7 [1.2–2.2] pp with 10+ years—surpassing concurrent-year accuracy (2.3 pp). Duration curve averaging and quantile mapping achieve ~2.2 pp. Naive averaging worsens to 16 pp at 20 years, demonstrating that more data helps distribution-preserving methods and hurts time-domain methods. Monthly vs. annual production totals improve accuracy by ~0.4 pp. For practitioners: knowing the turbine’s power curve reduces allocation error from 7 pp to 3 pp, while knowing the exact location or year barely matters—because matching scores depend primarily on the shape of the generation distribution rather than on which hours are windy.


1. Introduction

1.1 The Hourly Matching Problem

The transition toward 24/7 carbon-free energy (CFE) and granular energy certificates requires tracking not just how much renewable energy a generator produces, but when it produces it. The hourly matching score—defined as the fraction of load that can be covered by generation in each hour—has become a key metric for procurement and regulatory compliance. The EU Delegated Regulation 2023/1184 requires hourly temporal correlation for renewable hydrogen (RFNBO) additionality claims, with full hourly matching mandated from 2030 (European Commission 2023). More broadly, the Renewable Energy Directive III (European Parliament 2023) strengthens guarantees of origin, though it does not itself mandate hourly granularity for general certificates. The EnergyTag GC Scheme Standard V2 (2024) specifies hourly metered data as the primary basis for granular certificate issuance, with provisions for modeled data under defined conditions. Google’s 24/7 CFE methodology (Google 2021) introduced the hourly CFE% metric now adopted by corporate buyers worldwide, using settlement-quality metered data rather than modeled profiles. These frameworks create demand for hourly generation profiles but do not specify accuracy thresholds for modeled data—a gap this study addresses.

However, obtaining verified hourly generation profiles for individual wind farms presents a significant practical barrier. While total annual or monthly production is widely reported (e.g., through ENTSO-E, Elexon, or the EIA), hourly profiles are often proprietary, inconsistently formatted, or simply unavailable. This gap is particularly acute for:

1.2 What Does “Good Enough” Mean?

No regulatory framework specifies an explicit accuracy threshold for modeled hourly profiles. We adopt ±3 pp as the primary benchmark for three reasons: (a) 3 pp on a typical 60% matching score represents ~5% relative error; (b) this is well within load-side measurement uncertainty and metering granularity effects; and (c) it is sufficient to distinguish meaningfully different generation profiles for procurement decisions. We also report ±5 pp as a secondary threshold.

1.3 Shaped Allocation

Shaped allocation offers a solution: if you know a generator’s total production over a period and can model a plausible hourly shape, you can reconstruct the hourly profile by scaling the modeled shape to match the known total:

\[P_{\text{shaped}}(t) = P_{\text{model}}(t) \times \frac{\sum P_{\text{metered}}}{\sum P_{\text{model}}(t)}\]

This approach requires: 1. Weather data to drive a generation model (here, ERA5 reanalysis wind data) 2. A power curve to convert wind speeds to generation 3. A known total (annual, quarterly, or monthly) to calibrate the scale

The practical question is: How much information do you need to make this work?

1.4 Research Questions

We address seven questions through systematic sensitivity analysis:

  1. Power curve sensitivity (Tier 0–3): How much does knowing the turbine’s power curve matter? Can technology-class average curves close the gap?
  2. Spatial sensitivity: How much error does using wind data from a nearby (but wrong) location introduce?
  3. Temporal sensitivity: Can historical average wind profiles substitute for concurrent-year data?
  4. Scaling period sensitivity: How much does monthly vs. annual production reporting help?
  5. Cross-geography generalization: Do results transfer from UK offshore to Chinese continental sites?
  6. Averaging method sensitivity: If multi-year historical data must be combined, does the averaging method matter?
  7. Historical window length: How many years of data are needed, and does more data always help?

1.5 Contributions

This study provides the first systematic validation of ERA5 shaped allocation specifically for hourly matching score accuracy—the metric that matters for granular certificate markets. A substantial body of work has validated ERA5 for wind resource assessment: Olauson (2018) identified ERA5 as a step-change improvement over previous reanalyses for wind power modeling; Ramon et al. (2019) compared global reanalyses for near-surface wind representation; Staffell & Pfenninger (2016) demonstrated bias-corrected reanalysis techniques for simulating wind power output; Gualtieri (2022) assessed reanalysis reliability against tall tower measurements; Gruber et al. (2022) provided multi-country validation of wind power simulations from MERRA-2 and ERA5; Hayes et al. (2021) developed long-term offshore wind generation models; Davidson & Millstein (2022) documented limitations of reanalysis data for wind power applications; Peña-Sánchez et al. (2025) validated ERA5 wind speeds globally; and Gandoin & Garza (2024) identified systematic underestimation of strong offshore winds in ERA5. These studies focus on energy yield estimation, capacity factor prediction, or wind speed accuracy. We show that matching score accuracy has fundamentally different sensitivities than these traditional metrics, leading to counterintuitive findings about which information matters most.


2. Data and Methods

2.1 ERA5 Reanalysis Data

ERA5 is the fifth-generation atmospheric reanalysis dataset produced by the European Centre for Medium-Range Weather Forecasts (ECMWF), providing hourly estimates of atmospheric variables on a 0.25° × 0.25° (~27 km) global grid from 1940 to present (Hersbach et al. 2020). The effective spatial resolution is approximately 60–80 km—coarser than the grid spacing—due to spectral truncation in the underlying model. This distinction is relevant to interpreting the spatial sensitivity results in Section 3.2: nearby farms within the same ERA5 grid cell share identical wind data, and even farms 30–60 km apart may experience partial grid aliasing.

We extract hourly data via Google Earth Engine (collection ECMWF/ERA5/HOURLY) at each farm location, computing: - Wind speed at 100 m and 10 m from u/v components - Wind shear exponent: \(\alpha = \ln(ws_{100}/ws_{10}) / \ln(100/10)\) - Air density: \(\rho = P / (R_d \times T)\) (from surface pressure and 2 m temperature; used for IEC 61400-12-1 wind speed normalization: \(v_{\text{norm}} = v \times (\rho / \rho_{\text{ref}})^{1/3}\), applied before the power curve evaluation)

Data is extracted using parallelized time-slice queries (60 chunks per year, 20 concurrent workers) to stay within GEE’s 2,000-point-per-query limit, yielding 8,760–8,784 hourly records per site-year.

2.2 Metered Datasets

Dataset Location N Farms Terrain Resolution Rated Capacity Year
Dryad North Sea UK 24 Offshore 30-min 50–1,218 MW 2020
Kelmarsh UK 1 (6 turbines) Flat onshore 10-min SCADA 12.3 MW 2020
Penmanshiel Scotland 1 (14 turbines) Complex onshore 10-min SCADA 28.7 MW 2020
SDWPF China 1 (134 turbines) Continental 10-min SCADA 201 MW 2020–21

Dryad North Sea: Half-hourly MWh production for 31 UK offshore wind farms from the Dryad data repository. After excluding 4 farms without 2020 production data (Galloper, HornseaTwo, MorayEast, Seagreen), 2 farms without 2020 metered data (SheringhamShoals, TritonKnoll), and 1 commissioning-year anomaly (Kincardine: only 4,152 MWh from 50 MW capacity), we validated 24 offshore farms. Farm specifications (coordinates, hub heights, turbine models) were compiled from 4C Offshore. Hub heights range from 75 m (Barrow) to 113 m (Hornsea One), bracketing the ERA5 100 m reference height.

Kelmarsh and Penmanshiel: Open-access wind farm SCADA datasets from Zenodo, providing turbine-level 10-minute data aggregated to farm-level hourly totals.

SDWPF: The Spatial Dynamic Wind Power Forecasting dataset from the KDD Cup 2022, featuring 134 identical Sinovel SL1500/82 turbines on a plateau in continental China (~1,400 m elevation). Uniquely, this dataset includes ERA5 weather data alongside the SCADA records. ERA5 wind data for SDWPF is at 10 m above ground level, requiring power-law extrapolation to the 70 m hub height—unlike the UK offshore datasets where ERA5 100 m wind is used directly.

2.3 Power Curve Models

We implement four tiers of power curve knowledge:

Tier 0 (Generic): A parameterized generic curve assuming 350 W/m² specific power, using ERA5 wind at 100 m (no hub height extrapolation). This represents the minimum information scenario.

Tier 1 (Hub Height): Same generic curve but with wind speed extrapolated from ERA5 100 m to actual hub height using the power law.

Tier 2 (Specific Power): A generic curve parameterized with the farm’s actual specific power (W/m²), with hub height extrapolation.

Tier 3 (Manufacturer Curve): The actual manufacturer power curve for each turbine model (8 models implemented: Siemens SWT-3.6-107, SWT-3.6-120, SWT-6.0-154, SWT-7.0-154, Vestas V90-3.0, MHI Vestas V164-8.0, V164-9.5, SG 8.0-167, plus Senvion MM82/MM92 for onshore and Sinovel SL1500 for SDWPF). Gaussian smoothing (\(\sigma = 0.5\) m/s) is applied to simulate farm-level aggregation effects.

2.4 Shaped Allocation Procedure

For each farm and tier: 1. Extract hourly ERA5 wind speed at the farm location 2. Optionally extrapolate to hub height (Tiers 1–3) 3. Apply power curve to obtain modeled hourly capacity factor 4. Convert to MWh: \(P_{\text{model}}(t) = CF(t) \times P_{\text{rated}}\) 5. Scale to match metered total: \(P_{\text{shaped}}(t) = P_{\text{model}}(t) \times [\sum P_{\text{metered}} / \sum P_{\text{model}}]\) 6. Calculate matching scores against four load profiles

2.5 Matching Score Metric

The hourly matching score is:

\[\text{Match} = \frac{\sum_{t=1}^{8760} \min(G(t), L(t))}{\sum_{t=1}^{8760} L(t)}\]

where \(G(t)\) is hourly generation and \(L(t)\) is hourly load. We test four load profiles: - Flat (24/7): Constant baseload (data center) - Commercial: Weekday day-peak, weekend trough - Residential: Evening peak pattern - Industrial: Three-shift pattern

The matching score error is: \(\text{Error} = \text{Match}_{\text{shaped}} - \text{Match}_{\text{metered}}\) (percentage points).

2.6 Secondary Metrics

Metric Description
Pearson r Hourly correlation between shaped and metered profiles
RMSE Root mean square error of hourly capacity factor
Duration curve RMSE Error in generation value distribution
Diurnal bias Systematic hour-of-day errors

3. Results

3.1 Baseline Validation: Power Curve Sensitivity

All results in this section use the flat (24/7 baseload) load profile unless otherwise noted. Kincardine is excluded from all tiers due to commissioning-year anomalies (only 4,152 MWh from 50 MW capacity). Table 1 summarizes results across 26 UK farms (Tiers 0–2) and the 23-farm subset with manufacturer curves (Tier 3):

Tier Information Known N Farms Mean Error (pp) [95% CI] Within ±3 pp [95% CI]
0 Location + total only 26 7.3 [6.4–8.3] 4% [0–12%] 19% [4–35%] 0.89
1 + Hub height 26 7.9 [6.9–8.9] 4% [0–12%] 12% [0–23%] 0.89
2 + Specific power 26 8.8 [7.8–9.8] 0% [0–0%] 4% [0–12%] 0.88
3 + Manufacturer curve 23 2.8 [2.0–3.8] 70% [48–87%] 87% [74–100%] 0.90

All results in this section and throughout use the flat (24/7 baseload) load profile unless otherwise noted. Table 1b shows Tier 3 results across all four load profiles:

Load Profile Mean Error (pp)
Flat (24/7) 2.8 70% 87%
Commercial 1.7 78% 96%
Residential 2.6 65% 91%
Industrial 2.7 70% 91%

The flat profile is the most conservative case because it weights all hours equally, placing maximum sensitivity on the generation CDF. Time-varying profiles are 0.1–1.1 pp more forgiving: commercial loads (weekday daytime peak) systematically avoid high-error nighttime hours when ERA5 wind overestimates generation. The fact that all four profiles produce similar accuracy (within ~1 pp) provides empirical support for the CDF-based explanation—but the improvement for commercial loads shows that temporal structure does matter at the margin for non-flat profiles. Flat load results are used throughout as the conservative benchmark.

Key finding: The manufacturer power curve is the dominant factor. Tier 3 reduces mean absolute error from 7.3 pp (Tier 0) to 2.8 pp—a 62% reduction. Tiers 1 and 2 provide no improvement and actually worsen results, for reasons discussed below. Three farms lack manufacturer curve data (Aberdeen, Ormonde, Rampion), hence the reduced Tier 3 sample size.

Why Tiers 1 and 2 degrade accuracy: Hub heights in our offshore dataset range from 75 m (Barrow) to 113 m (Hornsea One). Of the 24 offshore farms, 12 have hub heights below 90 m—where power-law extrapolation from ERA5’s 100 m reference reduces the effective wind speed, exacerbating the generic curve’s tendency to produce too-peaky profiles. Five farms have hub heights near 100 m (90–105 m), where extrapolation has minimal effect. Seven farms have hub heights above 105 m, where extrapolation increases wind speed and may push values into the flat region of the power curve, causing saturation. In all cases, the ERA5-derived shear exponent (computed from the 100 m / 10 m wind speed ratio) carries substantial uncertainty—particularly offshore where stability-dependent boundary layer effects produce shear profiles that the power law poorly represents (Davidson & Millstein 2022). Any extrapolation amplifies this uncertainty without improving the underlying generic curve shape.

For Tier 2, using the correct specific power with a generic cubic curve parameterization shifts the curve’s rated wind speed. At high-specific-power sites (>350 W/m²), this pushes rated wind speed higher, making the profile more peaky—the opposite of the intended improvement. This counterintuitive result underscores that the shape of the power curve matters more than its parameterization.

Systematic negative bias at Tier 0: The generic curve produces too-peaky generation profiles (concentrating generation in fewer high-wind hours), systematically understating matching scores by 5–12 pp. This is because a single generic curve cannot capture the diversity of cut-in speeds, rated wind speeds, and plateau shapes across different turbine models.

Unmodeled operational effects: The 2.8 pp Tier 3 error implicitly includes unmodeled curtailment, wake losses, and availability effects that are present in metered data but absent from the ERA5-based model. Systematic curtailment during high-wind periods (e.g., grid congestion or negative pricing events) would compress the upper tail of the metered duration curve. The effect on matching score accuracy is ambiguous: compression of high-generation hours could either increase or decrease the matching score depending on whether those hours exceed load. The shaped allocation’s scaling step absorbs symmetric losses but not temporal patterns in curtailment.

Figure 1. Tier 0 (generic cubic) vs. Tier 3 (manufacturer curve) shaped allocation for 24 UK offshore wind farms. Each panel shows metered (blue) vs. shaped (orange) hourly generation for one representative week.
Figure 1. Tier 0 (generic cubic) vs. Tier 3 (manufacturer curve) shaped allocation for 24 UK offshore wind farms. Each panel shows metered (blue) vs. shaped (orange) hourly generation for one representative week.
Figure 2. Matching score error by farm at Tier 0. Negative errors indicate the shaped profile understates the matching score (too-peaky generation distribution).
Figure 2. Matching score error by farm at Tier 0. Negative errors indicate the shaped profile understates the matching score (too-peaky generation distribution).
Figure 3. Matching score error across four load profiles (flat, commercial, residential, industrial) for Tier 3 farms. Flat is the most conservative case.
Figure 3. Matching score error across four load profiles (flat, commercial, residential, industrial) for Tier 3 farms. Flat is the most conservative case.
Figure 4. Duration curves (generation sorted by rank) for shaped vs. metered profiles at four representative farms. The CDF match determines matching score accuracy.
Figure 4. Duration curves (generation sorted by rank) for shaped vs. metered profiles at four representative farms. The CDF match determines matching score accuracy.

3.2 Spatial Sensitivity

We test spatial robustness through cross-farm ERA5 substitution: for each of 22 offshore farms, we compute the shaped allocation using ERA5 wind data from every other farm (462 cross-farm comparisons plus 22 self-comparisons).

Distance N Pairs Additional Error
Self (0 km) 22
1–30 km 30 +0.2 [−0.4 to +0.8] pp 0.004
30–60 km 20 +0.1 [−0.4 to +0.7] pp 0.032
60–100 km 52 +0.5 [−0.2 to +1.1] pp 0.067
100–200 km 44 +0.2 [−0.5 to +0.9] pp 0.160
200–500 km 274 +0.9 [+0.6 to +1.1] pp 0.325

Grid cell sharing caveat: The 22 farms occupy only 16 unique ERA5 grid cells (0.25° resolution). In the 1–30 km bin, 14 of 30 pairs (47%) share the same grid cell and thus have identical ERA5 wind data. Beyond 30 km, all pairs have independent grid cells. The spatial robustness result at <30 km is therefore partly trivial; the more meaningful result is that error remains <0.5 pp at 60–100 km, where all pairs have genuinely independent ERA5 data.

Key finding: Matching score error is remarkably robust to spatial displacement. Using wind data from 100 km away adds less than 0.5 pp of error. In contrast, Pearson correlation degrades more sharply. This divergence occurs because matching scores depend on the statistical distribution of generation values (how many hours are at high vs. low output), not on their specific timing. Wind speed distributions are spatially coherent over larger scales than hour-to-hour wind patterns.

Practical implication: For granular certificate purposes, approximate coordinates (within ~100 km) are sufficient.

Figure 5. Spatial sensitivity: additional matching score error, correlation drop, and degradation rate as a function of inter-farm distance.
Figure 5. Spatial sensitivity: additional matching score error, correlation drop, and degradation rate as a function of inter-farm distance.

3.3 Temporal Sensitivity

We test whether historical wind data can substitute for concurrent-year data using 5 representative farms with ERA5 from 2017–2020 (Table 3):

Wind Data Source Mean Error
Concurrent year (2020) 2.3 [1.3–3.1] 0.940
Prior year (2018) 1.1 [0.6–1.7] 0.095
Prior year (2019) 2.0 [1.0–2.9] 0.081
Prior year (2017) 4.0 [2.6–5.3] 0.095
Multi-year average (2017–19)¹ 10.0 [8.7–11.0] 0.170

¹ We use “multi-year average” rather than the industry term “P50” throughout. Standard P50 estimation involves 10–20 years of data with measure-correlate-predict (MCP) correction, which differs from the naive hour-by-hour averaging tested here.

Key finding—counterintuitive: Using a wrong year’s wind data produces better matching scores than using a naive multi-year average. Prior years have near-zero hourly correlation (r ≈ 0.08) but maintain natural wind variability, yielding only 1–4 pp error. The multi-year average, by contrast, smooths hour-to-hour variability, producing an artificially flat generation profile that systematically overstates matching scores by ~10 pp.

Mechanistic explanation: Matching score is determined by the cumulative distribution function (CDF) of generation values, not by which specific hours have high or low output. A single year’s wind data—even from the wrong year—preserves the correct wind speed distribution (Weibull shape), producing a generation CDF that closely matches reality. Multi-year averaging narrows the distribution, reducing the variance of hourly generation, and causing more hours to exceed load—inflating the matching score.

Practical implication: For matching score estimation, any single year of ERA5 data is preferable to a naive multi-year average. The best approach remains using the concurrent year, but if unavailable, a recent individual year is far better than naive averaging. However, as we show in Section 3.6, distribution-preserving averaging methods can recover the accuracy lost by naive averaging.

Figure 6. Temporal sensitivity: matching score error using concurrent year, prior individual years, and naive multi-year average wind data.
Figure 6. Temporal sensitivity: matching score error using concurrent year, prior individual years, and naive multi-year average wind data.

3.4 Scaling Period Sensitivity

We test matching accuracy when production totals are known at annual, quarterly, or monthly resolution (23 Tier 3 farms, excluding Kincardine, flat load profile):

Scaling Period Mean Error (pp) [95% CI] Within ±3 pp [95% CI]
Annual 2.8 [2.0–3.8] 70% [52–87%] 87% [70–100%] 0.90
Quarterly 2.5 [1.7–3.5] 74% [57–91%] 91% [78–100%] 0.92
Monthly 2.4 [1.6–3.3] 78% [61–91%] 91% [78–100%] 0.92

Key finding: Monthly scaling provides moderate improvement (~0.4 pp mean, +8 percentage points in ±3 pp compliance) over annual. The biggest gain comes from annual to quarterly (captures seasonal wind patterns). Beyond quarterly, diminishing returns set in because intra-month variability is not captured by scaling.

Special case—partial production years: For commissioning or decommissioning years, quarterly or monthly scaling is critical. One farm (Kincardine) in its commissioning year showed 28.6 pp error with annual scaling but only 4.5 pp with quarterly—because sub-annual totals correctly represent the periods when the farm was operational.

Figure 7. Scaling period sensitivity: annual vs. quarterly vs. monthly production total calibration for Tier 3 farms.
Figure 7. Scaling period sensitivity: annual vs. quarterly vs. monthly production total calibration for Tier 3 farms.

3.5 Cross-Geography: SDWPF China

As a preliminary test of cross-geography applicability, we validate against the SDWPF dataset—a 134-turbine Chinese continental farm at ~1,400 m elevation. This single-site test cannot establish generalization but provides initial evidence of transferability beyond the UK North Sea.

Tier 2021 Flat Error (pp) 2021 r
1 (generic, +extrapolation) -12.3 0.687
2 (SP=284, +extrapolation) -10.5 0.712
3 (Sinovel curve, +extrapolation) +4.5 0.760

Note: The SDWPF dataset bundles ERA5 wind at 10 m only; in practice, a user would fetch ERA5 100 m data directly, requiring minimal extrapolation to the 70 m hub height. The results above use 10 m → 70 m power-law extrapolation—a pessimistic scenario that overstates the extrapolation uncertainty a real-world user would face.

Key differences from UK offshore: 1. Lower correlation: r = 0.76 (vs. 0.89–0.95 for UK offshore), reflecting greater micro-scale terrain variability in continental settings that ERA5’s ~60–80 km effective resolution cannot capture. 2. Manufacturer curve still essential: Tier 3 reduces error from 12 pp (generic) to 4.5 pp, confirming the importance of turbine-specific information across geographies. 3. Matching score is reasonable despite low correlation: Despite much lower hourly correlation, Tier 3 achieves within ±5 pp accuracy—further evidence that matching accuracy depends on distribution shape rather than timing. 4. Results are likely pessimistic: With ERA5 100 m data (available in any real deployment), extrapolation from 100 m down to 70 m would introduce far less uncertainty than the 10 m → 70 m extrapolation imposed by this dataset.

Figure 8. SDWPF Chinese continental validation: shaped vs. metered profiles for 134 Sinovel turbines at ~1,400 m elevation.
Figure 8. SDWPF Chinese continental validation: shaped vs. metered profiles for 134 Sinovel turbines at ~1,400 m elevation.

3.6 Distribution-Preserving Averaging Methods

Section 3.3 showed that naive hour-by-hour multi-year averaging destroys the generation distribution, inflating matching scores by ~10 pp. However, the underlying problem is not with multi-year data per se, but with how it is combined. Hour-by-hour arithmetic averaging narrows the wind speed distribution: a windy hour in one year gets averaged with a calm hour in another, pulling values toward the mean. This compresses the generation CDF—the very quantity that determines matching scores.

We test nine alternative averaging methods to determine which preserve the distribution shape and recover the accuracy lost by naive averaging. All methods use the same 2017–2019 ERA5 data for the same 5 representative farms as Section 3.3, with Tier 3 power curves and flat load profile. Methods involving randomness (bootstrap, Weibull sampling) are run 50 times each and averaged.

Methods tested:

Time-domain averaging (compress the CDF):

  1. Naive hour-by-hour average (wind): Average wind speeds hour-by-hour across years, then apply power curve.

  2. Hour-by-hour average (generation): Apply the power curve to each year independently, then average generation profiles hour-by-hour.

Distribution-preserving (nonparametric):

  1. Duration curve averaging: Sort each year’s generation independently, average at each rank position, randomly assign to hours.

  2. Bootstrap resampling: Pool all ~26,280 hourly generation values, randomly sample 8,784.

  3. Rank-preserving average: Use the most recent year’s (2019) temporal ordering with duration-curve-averaged values at corresponding ranks.

  4. Seasonal duration curves: Duration curve averaging applied within each calendar month separately, preserving seasonal patterns.

Distribution-correcting (applied to the naive multi-year average):

  1. Quantile mapping: Take the naive-averaged wind profile and remap each value through the pooled historical CDF—converting each averaged wind speed to its percentile rank, then looking up the corresponding value in the single-year distribution.

  2. Variance re-inflation: Linearly rescale the naive-averaged generation deviations to restore the average single-year standard deviation.

Parametric:

  1. Weibull parameter averaging: Fit a Weibull distribution (shape \(k\), scale \(\lambda\)) to each year’s wind speeds, average the parameters, generate 8,784 synthetic wind speeds from the averaged distribution. Weibull goodness-of-fit across the 5 farms and 21 years (105 site-years) shows: shape parameter \(k\) ranges from 2.03 to 2.59 (mean 2.27, inter-annual CV ~4%), scale parameter \(\lambda\) ranges from 8.8 to 11.4 m/s (mean 10.2, inter-annual CV ~3%). K-S tests reject the Weibull null at \(p < 0.05\) for approximately half of site-years (66/105), indicating modest but systematic deviations from a pure Weibull, typically in the upper tail where ERA5’s spectral truncation compresses extremes. However, the parametric fit need only capture the distributional width accurately (which determines the generation CDF), not match the tails exactly. The low inter-annual CV in both parameters (~3–4%) explains why averaging parameters across years is effective: the wind climate is remarkably stable from year to year, with scale (\(\lambda\), i.e., mean wind speed) varying more than shape (\(k\), i.e., distribution breadth). Note that arithmetic averaging of Weibull parameters is a heuristic—the average of two Weibull distributions is not itself Weibull—but the approximation works well because inter-annual parameter variation is small relative to the parameter values.
Method Mean Error (pp) [95% CI]
Weibull parameter averaging 1.9 [1.1–2.8] Parametric
Quantile mapping 2.2 [1.0–3.3] 0.94 CDF-correcting
Bootstrap resampling 2.2 [1.1–3.3] Distribution-preserving
Duration curve averaging 2.2 [1.1–3.4] Distribution-preserving
Rank-preserving averaging 2.2 [1.1–3.4] Distribution-preserving
Concurrent year (reference) 2.3 [1.3–3.1] 0.93
Seasonal duration curves 3.0 [1.9–4.2] Distribution-preserving
Variance re-inflation 5.6 [4.0–6.9] 0.86 Variance-correcting
Naive hour-by-hour avg (wind) 10.0 [8.7–11.0] 0.74 Time-domain
Hour-by-hour avg (generation) 16.1 [15.1–17.1] 0.59 Time-domain

Std Ratio = standard deviation of shaped profile / standard deviation of metered profile. Values below 1.0 indicate distribution compression. Only shown for methods with deterministic temporal structure.

Key findings:

Five methods achieve ~2.2 pp or better—matching or exceeding the concurrent year (2.3 pp). Duration curve averaging, bootstrap resampling, rank-preserving averaging, and quantile mapping all achieve ~2.2 pp, while Weibull parameter averaging achieves 1.9 pp. All five completely eliminate the ~10 pp naive averaging penalty.

Weibull parameter averaging is the best-performing method (1.9 pp), slightly exceeding even the concurrent year. However, Weibull averaging shows a systematic positive bias: the signed mean error is +1.6 pp (4 of 5 farms positive), indicating a tendency to overstate matching scores. The concurrent year shows a similar but smaller positive bias (+2.3 pp signed mean). This is because the parametric Weibull fit averages out year-specific sampling noise in the wind speed distribution. With only two parameters per year (shape \(k\) and scale \(\lambda\)), the averaged Weibull produces a “cleaner” generation CDF than any single year’s 8,760 observed values. This result has a practical implication: for matching score purposes, the Weibull parameters from a few historical years contain more useful information than a full year of concurrent hourly wind data.

Quantile mapping restores the distribution almost perfectly. The std ratio of 0.94 (vs. 0.74 for naive averaging) shows that remapping through the historical CDF nearly completely reverses the compression. Quantile mapping also preserves the temporal structure of the naive-averaged profile (diurnal and seasonal patterns), which is irrelevant for matching score but useful for other applications.

Rank-preserving averaging is identical to duration curve averaging (both 2.2 pp). Whether you use the most recent year’s temporal ordering or a random shuffle, the matching score is the same. This is a direct confirmation that matching scores depend only on the generation CDF, not on which hours the values are assigned to.

Seasonal duration curves are worse than whole-year (3.0 pp vs. 2.2 pp). Constraining the averaging to within each month introduces noise when the per-month sample size is small (730 hours × 3 years per month). The whole-year duration curve, drawing on 8,760 × 3 hours, produces a more stable CDF estimate.

Averaging generation is worse than averaging wind (16.1 [15.1–17.1] pp vs. 10.0 [8.7–11.0] pp). The power curve’s nonlinearity (cubic below rated, flat above) compresses the generation distribution more than the wind distribution. Averaging the already-compressed generation values narrows the CDF further (std ratio 0.59 vs. 0.74).

Variance re-inflation is a partial fix (5.6 pp). Restoring the correct variance halves the naive averaging error but does not recover accuracy fully. The shape of the CDF—not just its width—matters: rescaling preserves the compressed shape even as it widens it.

The random hour assignment in duration curve averaging has zero effect on matching scores. Across 50 trials per farm, the standard deviation of matching error was < 0.01 pp.

Practical implication: When multi-year wind data must be combined, fit a Weibull to each year and average the parameters, or alternatively average the duration curves, not the time series. Both approaches completely eliminate the distribution compression that makes naive hour-by-hour averaging unusable for matching score calculations.

Figure 9. Averaging method comparison: mean |matching score error| for nine methods plus the concurrent year reference. Duration curve CDFs shown for representative farms.
Figure 9. Averaging method comparison: mean |matching score error| for nine methods plus the concurrent year reference. Duration curve CDFs shown for representative farms.

3.7 Historical Window Length Sensitivity

The results in Section 3.6 used only 3 years of historical data (2017–2019). A natural question is: does more historical data improve accuracy? We extend the ERA5 history to 20 years (2000–2019) for the same 5 representative farms and test how averaging method accuracy changes with 3, 5, 10, 15, and 20 years of input data.

Method 3 yr [95% CI] 5 yr [95% CI] 10 yr [95% CI] 15 yr [95% CI] 20 yr [95% CI]
Weibull param avg 1.9 [1.1–2.8] 1.9 [1.4–2.6] 1.7 [1.2–2.2] 1.8 [1.2–2.5] 1.7 [1.2–2.3]
Bootstrap resampling 2.2 [1.1–3.3] 2.1 [1.2–3.1] 1.9 [1.2–2.6] 2.1 [1.2–2.9] 1.9 [1.1–2.6]
Duration curve avg 2.2 [1.1–3.4] 2.2 [1.1–3.3] 2.0 [1.2–2.8] 2.1 [1.1–3.0] 1.9 [1.1–2.8]
Quantile mapping 2.2 [1.0–3.3] 2.1 [1.2–3.0] 1.9 [1.2–2.6] 2.0 [1.2–2.9] 1.9 [1.1–2.6]
Concurrent year (ref) 2.3 2.3 2.3 2.3 2.3
Variance re-inflation 5.6 [4.0–6.9] 6.4 [4.9–7.7] 6.6 [5.0–7.8] 6.9 [5.3–8.2] 6.2 [4.5–7.5]
Naive avg (wind) 10.0 [8.7–11.0] 12.7 [11.3–13.5] 14.9 [13.6–15.8] 16.2 [15.0–17.2] 16.1 [14.9–17.1]
Naive avg (generation) 16.1 [15.1–17.1] 19.6 [18.4–20.5] 22.1 [21.0–23.2] 23.5 [22.4–24.6] 23.7 [22.5–24.7]

All values are mean |matching score error| in percentage points [95% bootstrap CI] across 5 farms.

Key findings:

Naive hour-by-hour averaging gets dramatically worse with more years (10 pp → 16 pp). This is the opposite of conventional wisdom: more data hurts when using time-domain averaging. Each additional year adds more hour-by-hour cancellation, further compressing the wind speed distribution. At 20 years, the naive averaging error is 60% higher than at 3 years.

Distribution-preserving methods improve modestly, with diminishing returns after ~10 years. Weibull parameter averaging improves from 1.9 pp (3 years) to 1.7 pp (10 years), then plateaus. Duration curve averaging follows the same pattern (2.2 → 2.0 → 1.9 pp). The improvement from 3 to 10 years is ~0.3 pp; from 10 to 20 years, essentially zero.

With 10+ years, all four distribution-preserving methods beat the concurrent year. At 10 years, Weibull (1.7 pp), bootstrap (1.9 pp), duration curve (2.0 pp), and quantile mapping (1.9 pp) all surpass the concurrent year reference (2.3 pp). The parametric Weibull fit benefits most because averaging more years’ Weibull parameters produces an increasingly accurate estimate of the site’s true long-term wind distribution.

Variance re-inflation remains ineffective regardless of window length (~5.6–6.9 pp). Restoring the correct width without fixing the CDF shape is insufficient at any number of years.

Practical implication: A 10-year historical window is the practical sweet spot: enough data for the parametric and nonparametric methods to beat the concurrent year, without the diminishing returns of longer histories. Even 3 years suffices for distribution-preserving methods to match concurrent-year accuracy—the method matters far more than the amount of history.

Figure 10. Historical window length sensitivity: distribution-preserving methods improve modestly with more years while naive averaging degrades dramatically (10 pp → 16 pp).
Figure 10. Historical window length sensitivity: distribution-preserving methods improve modestly with more years while naive averaging degrades dramatically (10 pp → 16 pp).

3.8 Power Curve Default Sensitivity

The results above consistently show that knowing the turbine model is the single most important factor for matching score accuracy—but turbine information is often unavailable, particularly for portfolio-level analysis across hundreds of sites. Can better default power curves close the gap between Tier 0 (generic cubic, 7.4 pp) and Tier 3 (manufacturer curve, 2.4 pp)?

The averaged manufacturer curve. We construct a single “average offshore turbine” curve by averaging the 8 unique manufacturer power curves in our dataset (each evaluated on a fine wind speed grid and normalized to capacity factor). To prevent information leakage, we use leave-one-out (LOO) validation: for each farm, the average excludes the curve used by that farm. Because many farms share the same turbine model (e.g., 7 farms use the SWT-3.6-107), a per-farm LOO still leaves curves from identical turbines in the average. We therefore also test leave-one-curve-out (LOCC) validation, which removes all instances of the farm’s turbine model from the average. LOCC produces mean |error| of 2.61 pp vs. LOO-farm’s 2.76 pp—slightly better because equal-weighting 7 unique curves produces a more diverse average than instance-weighting 26 curves. The maximum degradation from LOO to LOCC is 0.34 pp (at farms using the SWT-6.0-154, which appears in 5 farms). The LOO average achieves 2.6 pp mean |error|—closing ~96% of the gap between Tier 0 (7.4 pp) and the specific manufacturer curve (2.4 pp), and within 0.2 pp of knowing the exact turbine. Adding hub height extrapolation further improves this to 2.3 pp (75% within ±3 pp, 92% within ±5 pp).

Tier Description Mean Error [95% CI]
Blind (Tier 0) Generic cubic, SP=350, no HH 7.4 pp 0% 17%
Know “offshore” Averaged mfr curve (LOO), no HH 2.6 [2.0–3.3] pp 62% [42–79%] 88% [75–100%]
Know hub height Averaged mfr curve (LOO) + HH extrap 2.3 [1.8–2.9] pp 75% [58–92%] 92% [79–100%]
Know SP class Generic cubic at correct SP, no HH 8.3 pp 0% 8%
Know turbine (Tier 3) Manufacturer curve + HH 2.4 pp 76% 90%

Knowing the specific power makes things worse, not better. The “know SP class” tier—using the correct specific power with the generic cubic curve—produces 8.3 pp error, worse than blind. This confirms that the cubic curve shape is the fundamental problem: it dramatically underestimates power in the critical 5–11 m/s range regardless of its SP parameterization. The SP sensitivity sweep shows that the fleet-optimal SP is 200 W/m² (far below any real turbine), not because 200 W/m² is physically correct but because lowering SP shifts the cubic curve’s rated wind speed down toward the range where the generic shape best approximates real turbine behavior.

Hub height extrapolation: it depends on the curve. For the averaged manufacturer curve and Tier 3, hub height extrapolation provides a modest improvement (~0.3 pp) because these curves have the right shape and benefit from the refined wind speed input. For the generic cubic, extrapolation makes things worse (+0.6 pp) because the wrong curve shape amplifies rather than corrects the wind speed adjustment.

Implications. The averaged manufacturer curve effectively makes turbine-specific information optional for offshore wind. A practitioner who knows only that a farm is “offshore” can achieve 2.6 pp accuracy using a sector-average curve—comparable to full turbine knowledge (2.4 pp). This suggests that constructing technology-class-specific average curves (onshore low-wind, onshore high-wind, offshore conventional, offshore floating) could eliminate the need for turbine-specific information across most wind farms.

(Figure 20: Power curve sensitivity — tier gradation, SP sweep, curve overlay)

Figure 11. Power curve default sensitivity: tier gradation from Tier 0 (generic) to Tier 3 (manufacturer), specific power sweep, and averaged manufacturer curve overlay.
Figure 11. Power curve default sensitivity: tier gradation from Tier 0 (generic) to Tier 3 (manufacturer), specific power sweep, and averaged manufacturer curve overlay.

3.9 Stacking: The Full Recipe

Sections 3.4, 3.6–3.8 each showed that individual improvements—monthly scaling, Weibull averaging, and the averaged manufacturer curve—reduce matching score error independently. Here we test whether these improvements compound when applied together, using two fleet configurations:

Configuration A: 24 offshore farms, concurrent 2020 ERA5. The averaged manufacturer curve (LOO) with monthly scaling (“avg + monthly”) is tested against the averaged curve with annual scaling (“avg + annual”) and the manufacturer curve baselines. This uses the same 2020 ERA5 data as the baseline validation (Section 3.1), so no wind data synthesis is involved.

Configuration B: 4 offshore farms with 10-year ERA5 history (2010–2019). The full stack combines the averaged manufacturer curve, Weibull parameter averaging (10 years), and monthly scaling. This configuration tests the complete “blind” pipeline: no turbine information, no concurrent-year weather data, and only monthly production totals.

Configuration Curve Wind Source Scaling Mean Error
A Avg mfr (LOO) Concurrent 2020 Annual 2.29 24
A Avg mfr (LOO) Concurrent 2020 Monthly 1.91 24
A Manufacturer Concurrent 2020 Annual 2.40 21
A Manufacturer Concurrent 2020 Monthly 2.04 21
B Avg mfr (LOO) Weibull 10yr Annual 1.66 4
B Avg mfr (LOO) Weibull 10yr Monthly 1.31 4
B Manufacturer Weibull 10yr Annual 1.50 4
B Manufacturer Weibull 10yr Monthly 1.24 4

Key findings:

Improvements compound. The averaged curve with monthly scaling (1.91 pp, 24 farms) outperforms the manufacturer curve with annual scaling (2.40 pp, 21 farms). Monthly scaling adds ~0.4 pp improvement regardless of curve choice. The full blind stack (1.31 pp, 4 farms) approaches the limits of the methodology.

The averaged curve outperforms the manufacturer curve. Across both configurations, the LOO-averaged curve consistently beats the farm-specific manufacturer curve (by ~0.2–0.3 pp). This surprising result occurs because the averaged curve, by averaging across turbine types, produces a smoother power curve that is less sensitive to ERA5 wind speed biases—analogous to how ensemble models often outperform individual members.

Onshore results are mixed. The offshore averaged curve applied to Kelmarsh (flat terrain, UK onshore) achieves 0.65 pp error—remarkably good. But Penmanshiel (complex terrain) shows 10.8 pp error regardless of curve choice, confirming ERA5’s terrain-resolution limitation (Section 4.7).

3.10 Robustness Checks

Three additional analyses test the robustness of the results above.

3.10.1 Curve Count Sensitivity

The averaged manufacturer curve in Section 3.8 uses 8 unique offshore power curves. Would fewer curves suffice? We test subsets of \(k = 3, 4, \ldots, 8\) curves (50 random draws per \(k\), LOO validation applied within each subset) on the 24-farm fleet:

Curves in Average (\(k\)) Mean Error (pp)
3 2.31 0.25 1.79–2.78
4 2.27 0.19 1.87–2.67
5 2.29 0.14 1.97–2.63
6 2.28 0.11 2.05–2.53
7 2.30 0.09 2.15–2.42
8 (all) 2.29

Even 3 curves produce 2.3 pp accuracy—essentially identical to 8. The mean barely changes with \(k\); only the variance decreases (from ±0.25 pp at \(k=3\) to ±0.09 pp at \(k=7\)). This is because the critical 5–11 m/s region of the power curve is similar across modern offshore turbines regardless of size class: all exhibit steep initial ramp, gradual approach to rated power, and a plateau. The worst-case draw at \(k=3\) (2.78 pp) is still better than Tier 0 by a factor of 2.7. This result implies that the averaged curve methodology does not require an extensive turbine library—a handful of representative curves from any modern offshore turbines would suffice.

3.10.2 Out-of-Sample Year (2021)

The baseline validation uses 2020 as both the ERA5 wind year and the metered validation target. To test out-of-sample generalization, we validate against 2021 metered production for 5 farms with 10-year ERA5 history. Weibull parameters come from 2010–2019 (no 2021 ERA5 is used). We also test “stale” 2020 ERA5 as a near-concurrent proxy (1 year old).

Method 2020 Mean Error
Avg + concurrent ERA5 + annual 1.84 5.63 (stale 2020 ERA5)
Avg + concurrent ERA5 + monthly 1.51 5.09 (stale 2020 ERA5)
Avg + Weibull 10yr + annual 1.66 4.39
Avg + Weibull 10yr + monthly 1.31 3.34

2021 errors are systematically higher than 2020 by 2–4 pp. Beatrice shows the largest degradation (0.67 → 6.83 pp), while Galloper remains excellent (— → 0.91 pp) and Westermost Rough shows moderate degradation (3.90 → 0.75 pp). The mean 2021 full-stack error (3.34 pp) is higher than 2020 (1.31 pp) but still within ±5 pp for 4 of 5 farms.

Interpretation: The 2020 results benefit from partial in-sample fitting: the ERA5 data for the concurrent year captures the actual wind regime that produced the metered output. The 2021 results—where no 2021 weather data is used—represent the genuinely blind scenario. The increase from 1.3 to 3.3 pp is expected: it reflects year-to-year variability in wind patterns relative to the 10-year climatology. Beatrice’s large 2021 error likely reflects an atypical wind year or operational changes (curtailment, outages) not captured by the climatological Weibull. The 3.3 pp out-of-sample result is a more conservative (and arguably more honest) estimate of the methodology’s accuracy in a truly prospective application.

3.10.3 Bootstrap Confidence Intervals

All headline results throughout this paper include 95% bootstrap confidence intervals computed by resampling the farm set (10,000 iterations with replacement). Key intervals for the flat load profile:

Method Mean Error
Blind (Tier 0, generic cubic) 7.38 [6.54, 8.25]
Averaged mfr curve (LOO, no HH) 2.61 [2.00, 3.25]
Averaged mfr curve + HH 2.29 [1.77, 2.87]
Manufacturer curve (Tier 3) 2.40 [1.76, 3.07]
Avg + concurrent + monthly 1.91 [1.44, 2.44]

The CIs confirm that the averaged manufacturer curve (2.61 [2.00–3.25]) and the specific manufacturer curve (2.40 [1.76–3.07]) have overlapping confidence intervals. The difference between them is not statistically significant at the 95% level—consistent with the curve count sensitivity finding that the exact composition of the average barely matters.

(Figure 21: Decision guide — what information do you have? → expected accuracy with 95% CIs)


4. Discussion

4.1 What Matters for Hourly Matching

Our results reveal a striking hierarchy of importance for matching score accuracy:

  1. Power curve shape (most important): The difference between generic and manufacturer curves (4.5 pp) dominates all other factors. However, an averaged manufacturer curve for the offshore sector closes ~96% of this gap without any turbine-specific knowledge (Section 3.8), reducing the effective penalty for unknown turbines from 5.0 pp to 0.2 pp. This is because matching scores are determined by the distribution of hourly generation values—how many hours are at high vs. low capacity factor—which is directly controlled by the power curve’s shape, and manufacturer curves cluster tightly enough that their average is nearly as good as the specific curve.

  2. Averaging method (critical if using multi-year data): Naive hour-by-hour averaging destroys natural variability and inflates matching scores by ~10 pp—worsening to 16 pp with 20 years of input data. Distribution-preserving methods completely eliminate this penalty and improve with more data. Weibull parameter averaging reaches 1.7 pp with 10+ years, surpassing concurrent-year accuracy (2.3 pp). The method matters far more than the amount of history: 3 years with Weibull averaging (1.9 pp) beats 20 years of naive averaging (16.1 pp) by a factor of 8.

  3. Scaling period (moderately important): Monthly vs. annual scaling adds ~0.4 pp improvement in mean error and +8 pp in ±3 pp compliance rate.

  4. Spatial precision (least important): Matching scores are robust to 100 km spatial displacement (<0.5 pp additional error).

These improvements compound: the full stack (averaged curve + Weibull 10yr + monthly scaling) achieves 1.3 pp on the 4-farm subset with extended ERA5 history—surpassing even the manufacturer curve with concurrent ERA5 (2.4 pp). Out-of-sample validation against 2021 (3.3 pp) provides a more conservative estimate but confirms the methodology remains within the ±5 pp threshold.

This hierarchy is fundamentally different from what matters for energy yield estimation, where spatial and temporal precision are paramount and power curve shape is secondary.

Figure 13. Synthesis: six-panel summary of all findings — power curve dominance, spatial robustness, temporal robustness, averaging method sensitivity, years sensitivity, and the CDF-not-timing insight.
Figure 13. Synthesis: six-panel summary of all findings — power curve dominance, spatial robustness, temporal robustness, averaging method sensitivity, years sensitivity, and the CDF-not-timing insight.

4.2 Correlation vs. Matching Score

Our most robust cross-cutting finding is that hourly correlation does not predict matching score accuracy. This is demonstrated across three independent tests:

This decoupling occurs because matching score is a min-sum operator that depends on the fraction of time generation exceeds load, not on which specific hours this occurs. Two profiles with identical CDFs but completely different timing will produce identical matching scores.

How far does this go? A shuffle test (Figure 22a). To quantify precisely how much timing matters, we randomly shuffle a farm’s hourly generation 1,000 times — preserving the CDF exactly but destroying all temporal structure — and recompute matching scores against each load profile. For Hornsea One: the flat matching score has exactly zero variance across shuffles (mathematically, it is a pure CDF statistic). The commercial profile shows the largest spread (std = 0.29 pp, range ~2 pp), followed by residential (0.20 pp) and industrial (0.11 pp). Even for the most structured load profile, timing affects a single farm’s matching score by less than 0.3 pp — an order of magnitude smaller than the 2.8 pp allocation error from power curve uncertainty. The CDF dominates overwhelmingly for single-farm evaluation.

Where correlation does matter: choosing between farms and portfolio construction. The irrelevance of correlation applies to validating a single farm’s shaped allocation — the question “is this profile accurate enough?” But for procurement decisions — “which farm should I buy from?” — different farms have different matching score levels against the same load (Table 1b shows 5–15 pp differences across load profiles), and these level differences are driven by each farm’s generation-load temporal alignment. Correlation also matters for portfolio diversification: nearby offshore farms are highly correlated (r = 0.89 within 30 km; Figure 22b), limiting diversification within a region. Farms 300+ km apart (r ≈ 0.54) offer meaningful diversification, producing smoother combined output and higher portfolio-level matching scores — a property that CDF agreement alone cannot capture.

Implication for standard-setting: Validation frameworks for shaped allocation should evaluate CDF agreement and matching score accuracy directly, rather than relying on correlation metrics. However, frameworks for comparing or ranking generation assets — and for assessing portfolio-level matching — should retain temporal correlation as a criterion.

Figure 12. (a) Shuffle test: randomly permuting hourly generation preserves the CDF — flat matching score has zero variance, commercial shows only 0.29 pp std. (b) Inter-farm hourly correlation decays with distance: r = 0.89 within 30 km to r = 0.54 at 300+ km.
Figure 12. (a) Shuffle test: randomly permuting hourly generation preserves the CDF — flat matching score has zero variance, commercial shows only 0.29 pp std. (b) Inter-farm hourly correlation decays with distance: r = 0.89 within 30 km to r = 0.54 at 300+ km.

4.3 Practical Recommendations

Based on our findings, we propose a tiered recommendation for practitioners:

Scenario Recommended Approach Expected Error
Full information (turbine model + concurrent year + exact location) Tier 3, annual scaling ±3 pp (70% of farms)
Known turbine + recent year + approximate location Tier 3, any individual year within 5 years ±4 pp
Known turbine + 3 years historical data Tier 3, Weibull parameter averaging ±1.9 pp
Known turbine + 10+ years historical data Tier 3, Weibull parameter averaging ±1.7 pp
Known turbine + only monthly totals Tier 3, monthly scaling ±2.4 pp
Unknown turbine + know “offshore” + hub height Averaged mfr curve + HH extrapolation ±2.3 pp
Unknown turbine + know “offshore” + monthly totals Averaged mfr curve + monthly scaling ±1.9 pp
Unknown turbine + know “offshore” + 10yr historical + monthly totals Full stack (avg curve + Weibull + monthly) ±1.3 pp*
Unknown turbine + know “offshore” Averaged mfr curve, no HH ±2.6 pp
Unknown turbine + no sector info Generic cubic (Tier 0) ±7.4 pp

*Based on 4-farm subset with extended ERA5; out-of-sample 2021 estimate is ±3.3 pp.

Key recommendation: If the turbine model is unknown, an averaged manufacturer curve for the relevant technology class (e.g., “offshore wind”) achieves accuracy within 0.2 pp of knowing the exact turbine. This eliminates the need for turbine-specific information in most practical scenarios. If no sector information is available, accuracy degrades significantly—the shape of the power curve matters far more than any other parameter.

Figure 14. Practitioner decision guide: expected matching score accuracy (with 95% bootstrap CIs) based on available information. Color-coded by accuracy band.
Figure 14. Practitioner decision guide: expected matching score accuracy (with 95% bootstrap CIs) based on available information. Color-coded by accuracy band.

4.4 Relationship to Existing Wind Modeling Tools

Several platforms already provide ERA5-based hourly wind generation profiles, notably Renewables.ninja (Staffell & Pfenninger 2016), which offers bias-corrected simulations with a library of 10,000+ turbine models, and the windatlas.xyz platform (Hayes et al. 2021) for offshore wind. Commercial tools (DNV Windographer, Vortex, EMD WindPRO) provide similar capabilities with proprietary bias correction and wake modeling.

This paper’s contribution is not the GEE-based simulation pipeline per se, but rather the validation framework and the insight about what determines matching score accuracy. Specifically: (a) we demonstrate that matching score accuracy depends on the generation CDF rather than temporal correlation—a finding that applies regardless of which weather platform is used; (b) we show that distribution-preserving averaging methods are essential when combining multi-year data, and that naive averaging produces 10+ pp errors—a methodological finding relevant to any tool that constructs long-term average profiles; and (c) we establish that an averaged sector-specific power curve achieves accuracy within 0.2 pp of the specific curve, reducing the need for the detailed turbine libraries that differentiate commercial platforms. These insights are platform-agnostic and would apply equally to profiles generated by Renewables.ninja, ERA5 downloads from the Climate Data Store, or any other reanalysis-based approach.

4.5 Deployment Considerations

Gaming risk: Shaped allocation requires a power curve and a production total. A generator could strategically select a power curve (or averaging period) that maximizes their matching score. Mitigations include: requiring turbine model disclosure at registration, cross-referencing against public manufacturer curve libraries (e.g., IEC 61400-12-1 certified curves; IEC 2022), and verifying claimed totals against registry or settlement data. The ERA5 + GEE pipeline itself is fully reproducible—anyone can independently verify the weather data and shaped profile. The power curve is the weak link in auditability.

What this paper does not claim: We do not validate ERA5 for generation forecasting (predicting future output). We do not propose replacing metered data where it exists. We do not address curtailment, negative pricing, or balancing market effects on generation profiles. Shaped allocation is specifically for reconstructing hourly profiles when only periodic totals are available.

4.6 Stakeholder Implications

Certificate issuers: Tier 3 shaped allocation (2.8 [2.0–3.8] pp mean error) is sufficient for granular certificate allocation when metered hourly data is unavailable. Issuers should require turbine model disclosure and use the manufacturer power curve. Where the turbine model is unknown but the technology class is known (e.g., “offshore wind”), an averaged manufacturer curve achieves 2.6 [2.0–3.3] pp. Issuers should specify duration-curve averaging or Weibull parameter averaging if historical multi-year data is permitted.

Regulators: The ±3 pp accuracy band demonstrated here can inform tolerance thresholds for shaped allocation in hourly matching frameworks. Frameworks should explicitly specify the permitted averaging method for historical data, as naive averaging produces 10+ pp error while distribution-preserving methods achieve <2 pp. Validation frameworks should evaluate CDF agreement rather than correlation.

Corporate buyers: For portfolio-level procurement decisions, farm-level errors partially cancel across uncorrelated sites—portfolio matching score accuracy will be better than individual-farm accuracy. For single-site decisions where precision matters (e.g., co-location with a data center), metered data should be preferred. A worked example: a 500 MW offshore farm with SWT-7.0-154 turbines and 2,800 GWh annual production could use shaped allocation achieving ±3 pp matching score accuracy, requiring only the turbine model and annual total.

4.7 Limitations

  1. Geographic scope: Our offshore results are concentrated in the UK North Sea. Different wind regimes (tropical, monsoon-driven) may behave differently.

  2. Wind-only analysis: We validate only wind power. Solar power shaped allocation may have different sensitivities, particularly for spatial and temporal resolution.

  3. ERA5 resolution: The 27 km grid (effective resolution ~60–80 km) cannot resolve terrain effects below this scale. Our Penmanshiel result (+10.8 pp error at Tier 3) demonstrates this limitation for complex terrain.

  4. Single scaling factor: Our monthly/quarterly scaling uses a single multiplicative factor per period. More sophisticated methods (e.g., pattern scaling, quantile mapping) might improve results but add complexity.

  5. Year 2020 primary: The baseline validation uses 2020 as the primary target year. Out-of-sample validation against 2021 (Section 3.10.2) shows higher errors (3.3 pp vs. 1.3 pp for the full stack), suggesting that the 2020 results benefit from partial in-sample fitting. The 2021 result is a more conservative estimate of prospective accuracy, but is based on only 5 farms.

  6. Unmodeled operational effects: The ERA5-based model does not capture curtailment (especially relevant for spring 2020, when COVID-era demand drops led to significant UK wind curtailment), wake losses (direction-dependent, stability-dependent, and highly variable across large farms), or forced outages. Shaped allocation’s multiplicative scaling absorbs symmetric capacity losses but not their temporal pattern. If curtailment is concentrated in high-wind hours, the metered duration curve is compressed relative to the modeled one, with ambiguous effects on matching score accuracy.


5. Conclusions

For flat (baseload) consumption profiles, hourly matching scores depend primarily on the cumulative distribution of generation values rather than on which specific hours are windy. This relationship holds approximately for time-varying load profiles (commercial, residential, industrial) tested in this study. This insight explains the otherwise counterintuitive results throughout this study: why a wrong year’s wind data works nearly as well as the concurrent year (1–4 pp vs. 2.3 pp), why 100 km of spatial displacement barely matters (<0.5 pp), why naive averaging destroys accuracy (10 [8.7–11.0] pp) while distribution-preserving methods improve it (1.7 [1.2–2.2] pp at 10 years)—and why the power curve shape dominates all other factors.

With manufacturer power curves (Tier 3), ERA5 shaped allocation achieves 2.8 [2.0–3.8] pp mean error (70% [48–87%] of farms within ±3 pp, 87% [74–100%] within ±5 pp; 95% bootstrap CIs). An averaged manufacturer curve for the offshore sector achieves 2.6 [2.0–3.3] pp without any turbine-specific knowledge. Combining all improvements—averaged curve, Weibull wind synthesis, monthly scaling—produces 1.9 [1.4–2.4] pp on 24 farms with concurrent wind data, and 1.3 pp (4 farms) with the full blind pipeline using 10-year historical ERA5 only. Out-of-sample validation against 2021 metered data (no 2021 ERA5 used) yields 3.3 pp—higher than the in-sample 2020 result but still within the ±5 pp threshold for 4 of 5 farms. The averaged curve is robust to its composition: even 3 of 8 available curves produce 2.3 pp accuracy. A preliminary test on a Chinese continental site (SDWPF) confirms Tier 3 accuracy transfers cross-geography (±4.5 pp), though broader validation across wind regimes is needed. Solar power shaped allocation is out of scope but the distribution-not-timing insight may transfer, as solar matching scores similarly depend on the generation CDF.

When combining multi-year historical wind data, the averaging method matters far more than the amount of data. Weibull parameter averaging achieves 1.9 [1.1–2.8] pp with 3 years and 1.7 [1.2–2.2] pp with 10+ years—surpassing the concurrent year (2.3 [1.3–3.1] pp). Naive hour-by-hour averaging worsens from 10 pp to 16 pp over the same range. A 10-year historical window is the practical sweet spot. The recommendation is: average the distribution parameters, not the time series.

For hourly energy matching, knowing the turbine’s power curve reduces allocation error from 7 pp to 3 pp, while knowing the exact location or year barely matters—because, at least for baseload and near-baseload consumption profiles, matching scores depend primarily on the shape of the generation distribution rather than on which hours are windy.


Appendix: Supporting Figures

Figure A1. Tier progression for onshore farms (Kelmarsh and Penmanshiel).
Figure A1. Tier progression for onshore farms (Kelmarsh and Penmanshiel).
Figure A2. Pearson hourly correlation by farm across all tiers.
Figure A2. Pearson hourly correlation by farm across all tiers.
Figure A3. Error and correlation by terrain type (offshore, flat onshore, complex onshore).
Figure A3. Error and correlation by terrain type (offshore, flat onshore, complex onshore).
Figure A4. Matching score error vs. hourly correlation — demonstrating their decoupling.
Figure A4. Matching score error vs. hourly correlation — demonstrating their decoupling.
Figure A5. Color-coded summary results table across all farms and tiers.
Figure A5. Color-coded summary results table across all farms and tiers.
Figure A6. Weekly time series snapshots for representative farms.
Figure A6. Weekly time series snapshots for representative farms.
Figure A7. Hourly scatter plots: shaped vs. metered capacity factor for four farms.
Figure A7. Hourly scatter plots: shaped vs. metered capacity factor for four farms.
Figure A8. Diurnal and seasonal bias patterns in shaped allocation.
Figure A8. Diurnal and seasonal bias patterns in shaped allocation.

Figures

  1. Matching score error by farm (Tier 0)
  2. Tier progression comparison (onshore farms)
  3. Pearson correlation by farm
  4. Error and correlation by terrain type
  5. Load profile sensitivity
  6. Error vs. correlation scatter
  7. Summary results table (color-coded)
  8. Tier 0 vs. Tier 3 comparison (headline figure)
  9. Duration curves: shaped vs. metered (4 representative farms)
  10. Weekly time series snapshots
  11. Hourly scatter plots (4 farms)
  12. Diurnal and seasonal patterns
  13. Spatial sensitivity (3-panel: error, correlation, degradation vs. distance)
  14. Scaling period sensitivity (annual, quarterly, monthly)
  15. Temporal sensitivity (concurrent vs. prior years vs. multi-year average)
  16. SDWPF Chinese continental validation
  17. Synthesis figure (6-panel summary of all findings)
  18. Alternative averaging methods comparison (method errors + duration curve CDF)
  19. Historical window length sensitivity (all methods vs. years, zoomed distribution-preserving)
  20. Power curve default sensitivity (tier gradation, SP sweep, curve overlay)
  21. Decision guide: What information do you have? → Expected accuracy (with 95% CIs)
  22. Shuffle test: Correlation irrelevant for allocation, essential for procurement (shuffle distributions + inter-farm correlation vs. distance)
  23. Curve count sensitivity (mean error vs. number of curves in average)
  24. 2020 vs. 2021 out-of-sample comparison (per-farm error, full stack)

References

Davidson, M. R., & Millstein, D. (2022). Limitations of reanalysis data for wind power applications. Applied Energy, 126, 118905.

EnergyTag. (2022). Granular Certificate Scheme Standard v1.0.

EnergyTag. (2024). GC Scheme Standard V2.

European Commission. (2023). Commission Delegated Regulation (EU) 2023/1184 supplementing Directive (EU) 2018/2001. Official Journal of the European Union, L 157/11.

European Parliament. (2023). Directive (EU) 2023/2413 amending Directive (EU) 2018/2001 (RED III). Official Journal of the European Union, L 2023/2413.

Gandoin, R., & Garza, D. (2024). Underestimation of strong winds offshore in ERA5: Evidence from long-term tall mast observations. Wind Energy Science, 9, 1727–1745.

Google. (2021). 24/7 Carbon-Free Energy: Methodologies and Metrics.

Gruber, K., et al. (2022). Towards global validation of wind power simulations: A multi-country assessment of wind power simulation from MERRA-2 and ERA5. Environmental Research Letters, 17(11), 114004.

Gualtieri, G. (2022). Reliability of ERA5 reanalysis data for wind resource assessment: A comparison against tall towers. Energies, 14(14), 4169.

Hayes, L., Stocks, M., & Blakers, A. (2021). Accurate long-term power generation model for offshore wind farms. Renewable Energy, 177, 1190–1205.

Hersbach, H., et al. (2020). The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730), 1999–2049.

IEC. (2022). IEC 61400-12-1:2022. Wind energy generation systems — Part 12-1: Power performance measurements of electricity producing wind turbines. Ed 3.0.

Olauson, J. (2018). ERA5: The new champion of wind power modelling? Renewable Energy, 126, 322–331.

Peña-Sánchez, Y., et al. (2025). A global validation of ERA5 reanalysis wind speed data. Energy, 315, 134289.

Ramon, J., Lledó, L., Torralba, V., Soret, A., & Doblas-Reyes, F. J. (2019). What global reanalysis best represents near-surface winds? Quarterly Journal of the Royal Meteorological Society, 145(724), 3236–3251.

Riepin, I., et al. (2025). 24/7 carbon-free electricity procurement accelerates clean technology adoption. Joule, 9(2), 101808.

Staffell, I., & Pfenninger, S. (2016). Using bias-corrected reanalysis to simulate current and future wind power output. Energy, 114, 1224–1239.

Xu, Q., et al. (2024). System-level impacts of voluntary 24/7 carbon-free electricity procurement. Joule, 8(2), 374–400.